[Data] speedup checkpoint filter 5x by wxwmd · Pull Request #60002 · ray-project/ray

wxwmd · 2026-01-09T09:15:58Z

Modification

I'm using Ray Data's checkpoint. My data has 115 million records, with primary key {"id": str}. When I use Checkpoint to filter the input blocks, it takes several hours.

I checked the performance bottleneck and found it occurs in the filter_with_ckpt_chunk function in checkpoint_filter.py. I add some logs:

# Get all chunks of the checkpointed ID column.
ckpt_chunks = checkpointed_ids[self.id_column].chunks
# Convert the block's ID column to a numpy array for fast processing.
block_ids = block[self.id_column].to_numpy()

def filter_with_ckpt_chunk(ckpt_chunk: pyarrow.ChunkedArray) -> numpy.ndarray:
    t1 = time.time()
    ckpt_ids = transform_pyarrow.to_numpy(ckpt_chunk, zero_copy_only=False)
    print(f"ckpt_ids to numpy cost time {time.time()-t1}s")
   
    ...
    t2 = time.time()
    sorted_indices = numpy.searchsorted(ckpt_ids, block_ids)
    print(f"searchsorted costs {time.time()-t2}s")

the ckpt_chunk has shape (115022113), and block_ids has shape (14534). I got:

ckpt_ids to numpy cost time: 6.057122468948364s
searchsorted costs 0.11587834358215332s

We can see from the perf test that:

ckpt_chunks has only one chunk because we has combined chunks _combine_chunks
the ckpt_chunk is a very large chunk that holds 115 millon ids, convert it from pyarrow to numpy will costs 6s
For every input block, ckpt_ids = transform_pyarrow.to_numpy(ckpt_chunk, zero_copy_only=False) is executed once, causing a large time overhead.

This PR obtains the ckpt_id numpy array in advance, avoiding multiple calls. In my tests, this can reduce the filtering time from 5 hours to 40 minutes.

Notes:

In this PR, each read task needs to read the ckpt_ids(numpy.ndarray) from the object store, rather than Arrow format. This increases I/O and memory overhead because Arrow arrays usually costs less space. In my experiment, the pyarrow array(115 million rows, string-typed) used 1.7 GB of memory, while the numpy array used 9 GB. However, I this this memory overhead is acceptable because of the performance improvement.

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization for checkpoint filtering by converting checkpointed IDs to a NumPy array once, rather than for every block. The changes are well-implemented and consistent across the modified files. My review includes a couple of suggestions to enhance code clarity and maintainability.

python/ray/data/checkpoint/checkpoint_filter.py

gemini-code-assist · 2026-01-09T09:18:31Z

python/ray/data/tests/test_checkpoint.py

+    combined_ckpt_block = transform_pyarrow.combine_chunks(pyarrow_checkpointed_ids)
+
+    combine_ckpt_chunks = combined_ckpt_block[ID_COL].chunks
+    assert len(combine_ckpt_chunks) == 1
+    # Convert checkpoint chunk to numpy for fast search.
+    # Use internal helper function for consistency and robustness (handles null-typed arrays, etc.)
+    ckpt_ids = transform_pyarrow.to_numpy(combine_ckpt_chunks[0], zero_copy_only=False)


This logic for converting a pyarrow Table to a numpy array of IDs is duplicated from _combine_chunks in checkpoint_filter.py. To improve maintainability, consider extracting this logic into a non-remote helper function in checkpoint_filter.py and calling it from both _combine_chunks and this test. This would avoid having to update the logic in two places if it ever changes.

python/ray/data/checkpoint/checkpoint_filter.py

python/ray/data/checkpoint/util.py

wingkitlee0 · 2026-01-12T02:46:43Z

This is nice. Some optimizations can be considered for future PRs:

it may be worth sorting the block_ids when performing searchsorted(checkpoint_ids, block_ids). There are some numpy internal optimization. We may want to use the original order for output tho.
The industry-standard sortedcontainers library uses a list of list (i.e., chunking). We may be able to do something similar: chunking the long array into multiple shorter ones (<1M elements), so that they all fit in cache (individually).
related to the second point, partitioning may help to avoid repartition(1) when loading the checkpoint (I haven't read thru how the checkpoint is constructed yet, but repartition(1) seems heavy if the pipeline almost finishes..)

filtering time from 5 hours to 40 minutes.

Just to understand better, this is the total time spent in the filter function, right?

wxwmd · 2026-01-12T02:55:55Z

This is nice. Some optimizations can be considered for future PRs:

it may be worth sorting the block_ids when performing searchsorted(checkpoint_ids, block_ids). There are some numpy internal optimization. We may want to use the original order for output tho.

The industry-standard sortedcontainers library uses a list of list (i.e., chunking). We may be able to do something similar: chunking the long array into multiple shorter ones (<1M elements), so that they all fit in cache (individually).

related to the second point, partitioning may help to avoid repartition(1) when loading the checkpoint (I haven't read thru how the checkpoint is constructed yet, but repartition(1) seems heavy if the pipeline almost finishes..)

filtering time from 5 hours to 40 minutes.

Just to understand better, this is the total time spent in the filter function, right?

@wingkitlee0 yes, this is the total time spent in the filter function. This PR addresses the time overhead caused by repeated copies from pyarrow->numpy. After this, I believe the points you mentioned can further improve performance. I'm interested in implementing them.

python/ray/data/checkpoint/checkpoint_filter.py

python/ray/data/_internal/planner/planner.py

python/ray/data/checkpoint/checkpoint_filter.py

python/ray/data/checkpoint/load_checkpoint_callback.py

wxwmd · 2026-01-13T01:54:54Z

seems kind of messy, i will split this into 3 pr

Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com> keep this pr simple Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com>

wxwmd · 2026-01-19T12:03:53Z

moved to #60294

wxwmd requested a review from a team as a code owner January 9, 2026 09:15

gemini-code-assist bot reviewed Jan 9, 2026

View reviewed changes

wxwmd changed the title ~~[Data] speedup ckpt filter 5x~~ [Data] speedup checkpoint filter 5x Jan 9, 2026

cursor bot reviewed Jan 9, 2026

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 9, 2026

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Outdated Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 9, 2026

owenowenisme self-assigned this Jan 10, 2026